Homogenous Ensemble Learning in Highly Unbalanced Data¶
Name(s): Kaiwen Bian & Bella Wang
Website Link: https://kevinbian107.github.io/ensemble-imbalanced-data/
# for eda and modeling
import pandas as pd
import numpy as np
pd.options.plotting.backend = 'plotly'
from utils.dsc80_utils import *
from itertools import chain
Step 1: Introduction¶
Predictive model detecting user preference using textual features in combnation with other numerical features is the key first step prior to building a reconmander system or doing any other further analysis. The challenge that is addressed in this project is related to the high unbalance nature of the recipe data set that we are using.
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
Step 2: Data Cleaning and Exploratory Data Analysis¶
Merging¶
Initial merging is needed for the two dataset to form 1 big data set
- Left merge the recipes and interactions datasets together.
- In the merged dataset, fill all ratings of 0 with np.nan. (Think about why this is a reasonable step, and include your justification in your website.)
- Find the average rating per recipe, as a Series.
- Add this Series containing the average rating per recipe back to the recipes dataset however you’d like (e.g., by merging). Use the resulting dataset for all of your analysis. (For the purposes of Project 4, the 'review' column in the interactions dataset doesn’t have much use.)
Transformation¶
- Some columns, like
nutrition, contain values that look like lists, but are actually strings that look like lists. We turned the strings into actual columns for every unique value in those lists - Convert to list for
steps,ingredients, andtags - Convert
dateandsubmittedto Timestamp object and rename asreview_dateandrecipe_date - Convert Types
- Drop same
id(same withrecipe_id) - Replace 'nan' with np.NaN
Type Logic¶
String: [name, contributor_id, user_id, recipe_id, ]- quantitative or qualitative, but cannot perform mathamatical operations (quntitative discrete)
nameis the name of recipecontributor_idis the author id of the recipe (shape=7157)recipe_idis the id of teh recipe (shape=25287)idfrom the original dataframe also is the id of the recipe, dropped after merging
user_idis the id of the reviewer (shape=8402)
List: [tags, steps, description, ingredients, review]- qualitative, no mathamatical operation (qualitative discrete)
int: [n_steps, minutes, n_ingredients, rating]- quantitative mathamatical operations allowed (quantitative continuous)
float: [avg_rating, calories, total_fat sugar, sodium, protein, sat_fat, carbs]- quantitative mathamatical operations allowed (quantitative continuous)
Timestamp: [recipe_date, review_date]- quantitative mathamatical operations allowed (quantitative continuous)
Below are the full implementation of initial, which does the merge conversion, then transform, whcih carries out the neccessary transformation described above
def initial(df):
'''Initial claeaning and megrging of two df, add average ratings'''
# fill 0 with np.NaN
df['rating'] = df['rating'].apply(lambda x: np.NaN if x==0 else x)
# not unique recipe_id
avg = df.groupby('recipe_id')[['rating']].mean().rename(columns={'rating':'avg_rating'})
df = df.merge(avg, how='left', left_on='recipe_id',right_index=True)
return df
def transform_df(df):
'''Transforming nutrition to each of its own catagory,
tags, steps, ingredients to list,
submission date to timestamp object,
convert types,
and remove 'nan' to np.NaN'''
# Convert nutrition to its own caatgory
data = df['nutrition'].str.strip('[]').str.split(',').to_list()
name = {0:'calories',1:'total_fat',2:'sugar',3:'sodium',4:'protein',5:'sat_fat',6:'carbs'}
#zipped = data.apply(lambda x: list(zip(name, x)))
new = pd.DataFrame(data).rename(columns=name)
df = df.merge(new,how='inner',right_index=True, left_index=True)
df = df.drop(columns=['nutrition'])
# Convert to list
def convert_to_list(text):
return text.strip('[]').replace("'",'').split(', ')
df['tags'] = df['tags'].apply(lambda x: convert_to_list(x))
df['ingredients'] = df['ingredients'].apply(lambda x: convert_to_list(x))
# it's correct, just some are long sentences, doesn't see "'", notice spelling
df['steps'] = df['steps'].apply(lambda x: convert_to_list(x)) #some white space need to be handled
# submission date to time stamp object
format ='%Y-%m-%d'
df['submitted'] = pd.to_datetime(df['submitted'],format=format)
df['date'] = pd.to_datetime(df['date'],format=format)
# drop not needed & rename
df = df.drop(columns=['id']).rename(columns={'submitted':'recipe_date','date':'review_date'})
# Convert data type
df[['calories','total_fat','sugar',
'sodium','protein','sat_fat','carbs']] = df[['calories','total_fat','sugar',
'sodium','protein','sat_fat','carbs']].astype(float)
df[['user_id','recipe_id','contributor_id']] = df[['user_id','recipe_id','contributor_id']].astype(str)
# there are 'nan' values, remove that
for col in df.select_dtypes(include='object'):
df[col] = df[col].apply(lambda x: np.NaN if x=='nan' else x)
return df
merged = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id')
cleaned = (merged
.pipe(initial)
.pipe(transform_df))
display_df(cleaned)
| name | minutes | contributor_id | recipe_date | ... | sodium | protein | sat_fat | carbs | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 brownies in the world best ever | 40 | 985201 | 2008-10-27 | ... | 3.0 | 3.0 | 19.0 | 6.0 |
| 1 | 1 in canada chocolate chip cookies | 45 | 1848091 | 2011-04-11 | ... | 22.0 | 13.0 | 51.0 | 26.0 |
| 2 | 412 broccoli casserole | 40 | 50969 | 2008-05-30 | ... | 32.0 | 22.0 | 36.0 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 234426 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
| 234427 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
| 234428 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
234429 rows × 23 columns
Now this code would be used later on when we need to groupby using the recipe_id column or the user_id column for different purposes. The handling for different columns are also defined as below, which is diffeernt according to waht we need the columns are for later on in the modeling process.
def group_recipe(df):
func = lambda x: list(x)
check_dict = {'minutes':'mean', 'n_steps':'mean', 'n_ingredients':'mean',
'avg_rating':'mean', 'rating':'mean', 'calories':'mean',
'total_fat':'mean', 'sugar':'mean', 'sodium':'mean',
'protein':'mean', 'sat_fat':'mean', 'carbs':'mean',
'steps':'first', 'name':'first', 'description':'first',
'ingredients':func, 'user_id':func, 'contributor_id':func,
'review_date':func, 'review':func, 'recipe_date':func,
'tags':lambda x: list(chain.from_iterable(x))}
grouped = df.groupby('recipe_id').agg(check_dict)
#grouped['rating'] = grouped['rating'].astype(int)
return grouped
def group_user(df):
'''function for grouping by unique user_id and concating all steps/names/tags of recipe and averaging rating give'''
return (df #[df['rating']==5]
.groupby('user_id')['steps','rating','name','tags','minutes','calories','description','n_ingredients','ingredients','contributor_id','review']
.agg({'steps':lambda x: list(chain.from_iterable(x)),
'name':lambda x: list(x),
'tags':lambda x: list(chain.from_iterable(x)),
'rating':'mean',
'minutes':'mean',
'calories':'mean',
'description':lambda x: list(x),
'n_ingredients':'mean',
'ingredients':lambda x: list(chain.from_iterable(x)),
'contributor_id':lambda x: list(x),
'review':lambda x: list(x),
})
)
Univariate & Bivariate Analysis¶
Okay, after data cleaning, let's draw some graph to see what kind of data we are dealing with
px.violin(cleaned, x=['sodium','calories','minutes'])